Create scale/subscale scores from items; reverse‑code where needed before aggregating.
Prefer rowMeans() (optionally multiply by #items for “sum” scales) to use available data while tolerating a small amount of missingness.
Use reliability checks (e.g., Cronbach’s α) and check.keys = TRUE to detect/handle reverse‑keyed items.
Building Plots with the Grammar of Graphics
Start with ggplot(data, aes(...)), then add layers (+ geom_*) and optionally stat_*, scales, coordinates, faceting, and themes.
Map additional variables to aesthetics (color/fill/shape/size/alpha) to add information without clutter.
Distinguish mapping (variable-driven) from setting (constant) aesthetics to avoid confusing legends.
Choosing Univariate Displays
Histogram / Dotplot: Show distribution via bins; bin width controls granularity.
Density plot: Smooth distribution view (area = 1); useful alongside histograms.
Box/Violin: Compact summaries and distribution shape; robust to outliers.
Q–Q plot: Visual normality check; use with context (many analyses don’t require variables themselves to be normal, but residuals in some models do).
Checking Distributions & Outliers
Overlay theoretical curves or use diagnostic helpers to compare observed vs. reference distributions and to spot outliers.
Use sensible axis breaks (e.g., quantiles) to aid interpretation.
Bivariate, Categorical, and Mixed Displays
Scatterplots: Consider overplotting remedies (alpha, jitter); shapes can be hard to distinguish—use them judiciously or customize with manual scales.
Bar charts: For pre‑summarized data use stat = "identity"; otherwise map raw data and let stat summarize.
Percent/Proportion views: Compute proportions and confidence intervals before plotting to avoid misleading scales.
Mixed continuous × categorical: Combine raw points (e.g., dot/jitter) with summaries (means & CIs), flip coordinates for readability, and facet across groups for clear comparisons.
Presentation & Clarity
Set factor levels/labels early; keep legends informative (titles, breaks, labels).
Use themes and minimalist frames to focus attention on data (e.g., range frames, Tufte‑style).
Key Functions or Operators
Data & Scoring: rowMeans(), factor(), as.data.table(), psych::alpha(..., check.keys = TRUE)
The following object is masked from 'package:JWileymisc':
cor2cov
library(ggplot2)
Attaching package: 'ggplot2'
The following objects are masked from 'package:psych':
%+%, alpha
library(ggpubr)library(ggthemes)
Attaching package: 'ggthemes'
The following object is masked from 'package:JWileymisc':
theme_tufte
library(scales)
Attaching package: 'scales'
The following objects are masked from 'package:psych':
alpha, rescale
library(ggExtra)## turn off some notes from R in the final HTML documentknitr::opts_chunk$set(message =FALSE)
2 Data
To start with, we’ll load some data. Here we are going to use the data from the data collection exercise that previous honours cohorts did. For this code to work, you need to have the datasets in the same directory as this Rmarkdown file. If you pulled directly from git, then they should be there (but check).
The read_sav() function from the haven package let’s you read SPSS (i.e., *.sav) data files into R and then we use the as.data.table() function to convert the datasets into data.tables named db for baseline and dd for the daily data.
When we are working with our own data, often we have to perform some data management. For example the Positive and Negative Affect Schedule (PANAS) has different items (adjectives) capturing words related to positive and negative emotions. However, we do not typically analyze these individually. The individual items are scored into two subscales: positive and negative affect. We may use digital (e.g., Qualtrics, REDCap) or paper and pencil surveys, but those typically only provide us with the scores on each item. We have to create the subscale scores on our own. Put simply, we compute variable scores from item scores.
Almost always in psychology, scales/subscales are scored in one of two ways: either the individual items are simply added together or the individual items are averaged. Sometimes items are worded in the opposite direction and must first be reverse coded and then added/averaged. Let’s look at a few ways of doing this in R.
The simplest approach to adding items together is to literally list each item variable name separated by + for addition. This will work to create a total score. However, a downside is that under this approach if a participant misses any single item, they will be missing on the entire subscale.
Another approach is to use the rowMeans() function. This function allows you to perform the calculation (take the mean for a row of data) excluding missing data if desired, which equates to imputing the mean for an individual for any missing items. This is commonly done and if there are small amounts of missing data (e.g., if someone completed 18 / 20 items and only missed 2 / 20 items, is probably a good idea).
If you want to deal with missing data but need a total score, you can use rowMeans() but multiply the results by the number of items that should have been completed (e.g., 10 if a scale has 10 items).
As a rule of thumb, I recommend using rowMeans() and if the scale is typically added up, then multiply. This uses the most available data and most the time is sensible in my experience with raw data.
Note
Often in R, you will see short and long ways of doing the same thing. This may seem confusing at first, but you get used to it. The reason is that there are some circumstances where you really do need the long way, but mostly the short way is quicker and easier. For example, if you have a group of friends, but only one ‘Jane’ in the group, you probably just say ‘Jane’. However, if your friend group grows and you end up with two ‘Janes’ then you might call one ‘Jane T’ or ‘new Jane’ or find some other way to indicate which ‘Jane’ you are talking about. The same applies in R. There are usually shortcuts which are the default, but now and then you will get in a situation where you need a longer or more specific way of referring to a function or of accomplishing a specific task that cannot be managed with the short simple way.
Let’s look at each of these for the variable perceived stress from the baseline questionnaire of our data collection exercise. Finally, its standard to report the internal consistency or reliability of a scale. A common measure is Cronbach’s alpha, which we can get by using the alpha() function from the psych package. We do something new here by writing: psych::alpha() that is a long way of typing it and tells R that we want to use the alpha() function from the psych package. Mostly we don’t need to do this. However, the ggplot2 package also has a function called alpha() which is used for the alpha transparency level in plots. Because we are using two packages with the same function name, R can get confused so we write the package name in front of the function to be explicit about which one to use. If we were only using the psych or only using the ggplot2 package, this would not be necessary.
Note
You might have noticed something new, we used .SD, thats also a special symbol in data.table that means, the currently selected data. Well what IS the currently selected data? Whatever rows we picked and whichever columns specified by .SDcols, which we also listed at the end. That is needed because rowMeans() expects to be given a data set, not individual variables, but we are calling it already within db, a dataset, so we need some way of referring to a subset of the dataset within the data.table and the way we do that is with .SD. Its okay if that doesn’t make sense right now, just copy and paste the code and know where to change variable names is enough for this unit. More: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-sd-usage.html
## add items together, if missing any item, missing Stressdb[, StressADD := PSS1 + (6-PSS2r) + (6-PSS3r) + PSS4]## Or I could recode the variables PSS2r and PSS3r firstdb [, PSS2 :=6- PSS2r]db [, PSS3 :=6- PSS3r] ## average items db[, StressAVG :=rowMeans(.SD, na.rm =TRUE), .SDcols =c("PSS1", "PSS2", "PSS3", "PSS4")]## average items then multiply to get back to "sum" scaledb[, Stress :=rowMeans(.SD, na.rm =TRUE) *4, .SDcols =c("PSS1", "PSS2", "PSS3", "PSS4")]## Let's look at how stressed people were in previous years of PSY4210 in generalsummary(db$StressAVG)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 2.25 2.50 2.57 3.00 4.25
## let's try with out the reverse-coded items for PSS 2 and 3## Pay attention to the warnings providedpsych::alpha(as.data.frame(db[, .(PSS1, PSS2r, PSS3r, PSS4)]))
Warning in response.frequencies(x, max = max): response.frequency has been
deprecated and replaced with responseFrequecy. Please fix your call
Warning in psych::alpha(as.data.frame(db[, .(PSS1, PSS2r, PSS3r, PSS4)])): Some items were negatively correlated with the first principal component and probably
should be reversed.
To do this, run the function again with the 'check.keys=TRUE' option
Some items ( PSS1 PSS4 ) were negatively correlated with the first principal component and
probably should be reversed.
To do this, run the function again with the 'check.keys=TRUE' option
## Let's do the same thing with an additional 'check.keys=TRUE' option psych::alpha(as.data.frame(db[, .(PSS1, PSS2r, PSS3r, PSS4)]), check.keys =TRUE)
Warning in response.frequencies(x, max = max): response.frequency has been
deprecated and replaced with responseFrequecy. Please fix your call
Warning in psych::alpha(as.data.frame(db[, .(PSS1, PSS2r, PSS3r, PSS4)]), : Some items were negatively correlated with the first principal component and were automatically reversed.
This is indicated by a negative sign for the variable name.
## create a categorical stress variabledb[StressAVG <3, StrCat :="low"]db[StressAVG >=3, StrCat :="high"]db[, StrCat :=factor(StrCat, levels =c("low", "high"))]## wouldn't be a bad idea to also let R know that sex and relsta are factorsdb[, relsta :=factor( relsta, levels =c(1,2,3), labels =c("single", "in a committed exclusive relationship", "in a committed nonexclusive relationship"))]db[, sex :=factor( sex,levels =c(1,2),labels =c("male", "female"))]
3 Grammar of Graphics
ggplot2 is based on the grammar of graphics, a framework for creating graphs.
The idea is that graphics or data visualization generally can be broken down into basic low level pieces and then combined, like language, into a final product.
Under this system, line plots and scatter plots are essentially the same. Both have data mapped to the x and y axes. The difference is the plotting symbol (geometries labelled geoms in R) in is a point or line. The data, axes, labels, titles, etc. may be identical in both cases.
ggplot2 also uses aesthetics, which control how geometries are displayed. For example, the size, shape, colour, transparency level all are aesthetics.
3.1 Univariate Graphs
To begin with, we will make some simple graphs using ggplot2. The first step is specifying the dataset and mapping variables to axes. For basic, univariate plots such as a histograms, we only need to specify the dataset and what variable is mapped to the x axis. We can re-use this basic setup with different geometries to make different graphs.
pb <-ggplot(data = db, aes(x = Stress))
Note
Histograms define equal width bins on the x axis and count how many observations fall within each bin. Bars display these where the width of the bar is the width of the bin and the height of the bar is the count (frequency) of observations falling within that range. Histograms show a univariate distribution.
Using our basic mapping, we can “add” a histogram geometry to view a histogram.
Note
When you use geom_histogram() or geom_dotplot() you’ll likely get a warning that a default number of bins were used and you should specify a better binwidth. The binwidth controls how wide the histogram bars are or how wide the dots in the dotplot are. However, for our purposes the default option is almost always good enough and you can feel free to ignore this message generally, unless you really want to change your histogram.
pb +geom_histogram()
A histogram in ggplot2 for stress
## looks like we have a pretty normal distribution based on this
Note
Density plots attempt to provide an empirical approximation of the probability density function (PDF) for data. A probability density function always sums to one (i.e., if you integrated to get the area under the curve, it would always be one). The underlying assumption is that the observed data likely come from some relatively smooth distribution, so typically a smoothing kernel is used so that you see the approximate density of the data, rather than seeing exactly where each data point falls. Density plots show a univariate distribution.
We also can make a density plot, which also attempts to show the distribution, but using a smooth density function rather than binning the data and plotting the frequencies. Like histograms, the height indicates the relative frequency of observations at a particular value. Density plots are designed so that they sum to one.
pb +geom_density()
A density plot for stress
Note
Dot plots show the raw data. The data values are on the x axis. If two data points would overlap, they are vertically displaced leading to another name: stacked dot plots. They are good for small datasets. The y axis is kind of like a discrete density. Dot plots show a univariate distribution.
Another type of plot are (stacked) dotplots. These are very effective at showing raw data for small datasets. Each dot represents one person. If two dots would overlap, they are stacked on top of each other. While these often are difficult to view with large datasets, for small datasets, they provide greater precision than histograms.
pb +geom_dotplot()
A dot plot for stress
Finally, we can make Q-Q plots, although these require sample values instead of values on just the x-axis. We use the scale() function to z-score the data on the fly. Finally, we add a line with an intercept (a) and slope (b) using geom_abline() which is the line all the points would fall on if they were perfectly normally distributed. Remember: z=(x−mean)/SD
# "sample" is z-scores of Stress, geom_qq makes "theoretical" Z-scores of a standard normal distributionggplot(db, aes(sample =scale(Stress))) +geom_qq() +geom_abline(intercept =0, slope =1)
A QQ plot for stress
3.2 Checking Distributions
Note
We do not have to assume a normal distribution. The testDistribution() function supports many other types of distributions. For example this code would test whether the data followed a chi-squared distribution:
The log likelihood value is outputed for each graph, and this can be used to empirically pick the better fitting distribution as whichever distribution provides the highest log likelihood for the data. We do not get into different distributions too much in this unit, but as you go on, you may find that many variables do not follow a normal distribution and there are many statistical models that do not require outcome variables to follow a normal distribution.
To see more examples, look at: http://joshuawiley.com/JWileymisc/articles/diagnostics-vignette.html
We often use graphs also to visually assess assumptions, such as normality and for outliers. Most commonly, and certainly in what you likely learned to date, we will be assuming variables follow a normal distribution. We can use the testDistribution() function combined with plot() to create a plot that helps us examine both the distribution and outliers. There are two parts to this graph. First, it makes a density plot of the raw data as a solid black line. A dashed blue line superimposed shows what a normal distribution would look like with the same mean and standard deviation as the observed data. If these two densities are close, that indicates the variable is approximately normally distributed. The x axis labels are a five number summary of the data, they show:
Minimum (0th percentile)
1st quartile (25th percentile)
Median (50th percentile)
3rd quartile (75th percentile)
Maximum (100th percentile)
We also get a rug plot (the little vertical lines between the axis and the density plot) which are lines where there is raw data. That helps see where the raw data fall.
Below the density plot is a deviates plot. This is like a QQ plot, but rotated 45 degrees so that the line is horizontal instead of at an angle. If the dots fall near to the line at 0, it means they fall exactly where a theoretical normal distribution would be.
Finally, if we assume a variable follows a normal distribution, we may use z scores to identify extreme values or outliers. Z scores make sense when we assume a theoretical distribution, like the normal distribution. We can have outliers identified automatically by specifying extremevalues = "theoretical" and then specifying the percentage of the theoretical distribution in each tail we consider extreme. For example ev.perc = .005 means that we consider any score that is in the bottom 0.5% or top 0.5% of a normal distribution with the mean and standard deviation we observed for our data to be an “extreme” value and these will be identified in black while the rest of points are a light grey. There are no fixed guidelines on what threshold to use, many people use the top and bottom 0.1% as well (ev.perc = .001).
When there are multiple univariate distributions to view, density plots are probably one of the most efficient ways. Histograms are difficult to view because they either are stacked, which makes interpretation more difficult or dodged which is visually difficult to see, or overplotted, which can hide some of the data.
We can map additional variables to aesthetics such as the colour to include more information. For density plots, separating by colour is easy, by adding another variable, say StrCator sex or relsta as an additional aesthetic. For categorical aesthetics like color, if it had not already been a factor, its a good idea to convert it to a factor first so R knows that it is discrete and to order levels in the desired order we want, not the default alphabetical order (unless that is what you want).
ggplot(db, aes(Stress, colour = sex)) +geom_density() ## continuous stress scores by sex
Warning: Groups with fewer than two data points have been dropped.
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_density()`).
Density plot coloured by sex (and others) for stress
##To remove the NAs for sexggplot(db[!is.na(sex)], aes(Stress, colour = sex)) +geom_density() ## continuous stress scores by sex
Density plot coloured by sex (and others) for stress
For histograms, rather than control the colour of the lines, it is more helpful to control the fill colour. By default, overlapping bars are stacked on top of each other. So that there is not an NA group we remove anyone who is missing sex
ggplot(db[!is.na(sex)], aes(Stress, fill = sex)) +geom_histogram()
Histogram coloured by sex for stress
Overlapping bars also can be dodged instead of stacked.
ggplot(db[!is.na(sex)], aes(Stress, fill = sex)) +geom_histogram(position ="dodge")
Dodged histogram coloured by sex for stress
4 Bivariate Graphs
We can make bivariate plots by mapping variables to both the x and y-axis. For a scatter plot, we use point geometric objects.
We also can use lines for bivariate data. For this example, we will calculate the average mood and energy by day in the daily dataset and save this as a new, small dataset. Compared to a scatter plot, we only change point to line geometric objects.
With relatively discrete x axis, we can use a barplot for bivariate data. By default, geom_bar() calculates the count of observations that occur at each x value, so if we want our values to be the actual bar height, we set stat = "identity".
The grammar of graphics is designed to be like sentences, where you can add or modify easily. For example, “There is a ball.” or “There is a big, red, striped ball.” are both valid sentences. So to with graphics, we often can chain pieces together to make it more nuanced. In R we just “add” more by separating each component with +. Note that most argument names (i.e., data =, mapping =) are not strictly required. R will match input to the correct argument by position, often. The two sets of code below yield the same plot. In the first, we explicitly label all arguments, in the second we rely on position matching. Positional matching does not always work, for example we still must specify size = because we don’t always provide input for every argument, instead relying on defaults and only changing the specific arguments we want changed from defaults.
# Create a numeric version of Survey Day in dsum and name it ndaydsum[,nday:=as.numeric(SurveyDay)]# Create a factor version of Survey Day in dsum and name it fdaydsum[,fday:=as.factor(SurveyDay)]# Now let's create a pretty plot with colours! And with some lines and points.ggplot(data = dsum,mapping =aes(x = nday, y = energy)) +geom_bar(mapping =aes(fill = nday),stat ="identity") +geom_line(size =2) +geom_point(size =6)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Bar plot of day and energy
## note the continuous colour scale when your x is continuous# Now let's do the same thing but with the factor version of SurveyDayggplot(data = dsum,mapping =aes(x = fday, y = energy)) +geom_bar(mapping =aes(fill = fday),stat ="identity") +geom_line(size =2) +geom_point(size =6)
Bar plot of day and energy
# Alternatively you can also present your code as followsggplot(dsum, aes(fday, energy)) +geom_bar(aes(fill = fday), stat ="identity") +geom_line(size =2) +geom_point(size =6)
Bar plot of day and energy
# If you want to be even more extra (note use of both fday and nday)ggplot(dsum, aes(nday, energy)) +geom_bar(aes(fill = fday), stat ="identity") +geom_line(size =5, colour="green") +geom_point(size =3, colour ="orange")
Bar plot of day and energy
5 Improving Data Visualization
Before we continue examining graphs, it is helpful to think a bit more about what makes a good graph or good data visualization.
Edward Tufte and William Cleveland are two authors who have written extensively on data visualization and how to make good graphs. There work is well worth reading to improving understanding on how to efficiently convey data graphically.
One key principle that Tufte emphasises is the data to ink ratio. This ratio is how much data is conveyed versus ink used, and Tufte argues to try to maximize this (i.e., more data, less ink). To see this, consider the following graph, which is a fairly standard way stats programs (Excel, etc.) tend to make barplots.
Next, think about what data are conveyed in this graph. The bars capture two pieces of information: (1) the day and (2) the energy on that day. The only pieces of the bars we really need are the top. The rest of the bars take up a lot of ink, but convey no data. Points can do this more efficiently. The chart that follows has a much higher data-to-ink ratio as it is stripped back nearly to just the data.
Depending on the number of data points, one may push a bit further. Many people in practice find they are unfamiliar with these sort of graphs and at first it can take a bit longer to read. We are trained and used to seeing plots with “chartjunk” and low data to ink ratios. However, a chart like this is a far more condensed display of data and removes distractions to really highlight the raw data or results.
If we want something like axes but to be more useful, we can use the function geom_rangeframe() from the ggthemes package to put more informative data in. Range frames add “axes” but only that go the range of the observed data. Thus, these new axes show the minimum and maximum of each variable.
These informative axes are created by asking ggplot2 to create tick marks (breaks) on the x and y axis at the quintiles of the variables, which is the default output from the quantile() function. So now instead of the default tick marks / breaks on the axes, the breaks and numbers are: minimum (0th percentile), 25th percentile (i.e., lower quartile), 50th percentile (i.e., median), 75th percentile (i.e., upper quartile), maximum (i.e., 100th percentile). This provides a useful descriptive statistics on each variable in the plot, right in the axes.
Finally, we can make the axis labels more informative. Instead of presenting “pretty” numbers but that convey no data, we can pick axis labels and breaks at meaningful points of the data. One option is quantiles / percentiles: 0th, 25th, 50th (median), 75th and 100th percentiles are given by default from the quantile() function. Now almost every piece of ink in this figure conveys some useful information. We can visually see the range of the stress and selfesteem variables from the axes. We can see the median and interquartile range as well.
In the rest of the graphs we examine, we will try to implement this data to ink ratio principle. It does require some additional R code versus simply plotting with the defaults and often at first, you may need to spend a bit more time explaining your graph to people in text or in the figure legend. However, ultimately, such plots convey more data.
6 Advanced Bivariate Graphs
As with univariate graphs, we can map additional variables to additional aesthetics. This allows us to integrate more into standard bivariate plots. The following graph shows a simple example of this.
There are many other shapes available. To see a list, go to: http://sape.inf.usi.ch/quick-reference/ggplot2/shape.
However, although the shapes are distinguishable, they are visually difficult. Cleveland did some studies around shape perception, particularly when points may be partially overlapping and on this basis suggested other shapes. We can manually specify the number for which shape we want applied to which value of sex using the scale_shape_manual() function. We also use the name = argument so that instead of getting the legend labelled sex it is labelled Sex (yes, with a capital S).
Many more examples can be found online, for example: http://www.cookbook-r.com/Graphs/
Here we are going to put together several of the ideas learned to make some plots that could be included in presentations or publications. We will go through these fairly briefly and they serve largely as a “cookbook” with some examples you may want to use yourself later.
## view the labels for one of the ULS itemsattr(db$ULS1, "labels")
Never Rarely Sometimes Often
1 2 3 4
## make a summarised dataset with the means and labelssumdat <-melt(db[, .(ID,LackCompanionship = ULS1, LeftOut = ULS4,Isolated = ULS5, NoOneToTurnTo = ULS2)], id.vars ="ID")[, .(Mean =mean(value, na.rm =TRUE)), by = variable]sumdat[, Never :=paste0(variable, "\nNever")]sumdat[, Often :=paste0(variable, "\nOften")]## make a likert plotgglikert("Mean", "variable", "Never", "Often", data = sumdat,xlim =c(1, 5),title ="Average Loneliness Ratings")
Likert plot of means showing anchors
8 Summary Table for Data visualisation
Here is a little summary of some of the functions used so far. You might also enjoy this “cheatsheet” for ggplot2: https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf
Function
What it does
ggplot()
Sets the dataset and which variables map to which aesthetics for a plot
geom_boxplot()
Adds geometric object for boxplots
geom_density()
Adds a geometric object for density lines, a way to view a distribution
geom_histogram()
Adds a geometric object for a histogram, a way to view a distribution
geom_jitter()
Adds a points with some automatic random noise, helpful when one axis is discrete
stat_summary()
Used to automatically calculated some summary statistics on data and plot, usually means with standard errors or confidence intervals
gglikert()
Create a likert type plot showing means typically with the scale anchors
plot(testDistribution())
Used to check whether a variable follows a specific, assumed distribution, typically a normal distribution
ylab()
Adds a label for the y axis
xlab()
Adds a label for the x axis
9 More advanced data visualisation - Categorical and Continuous variables
## compute the personality variables if you haven't already done sodb [, BFI_N1 :=6- BFI_N1r]db[, neuroticism:=rowMeans(.SD, na.rm =TRUE),.SDcols =c("BFI_N1", "BFI_N2")]db [, BFI_E1 :=6- BFI_E1r]db[, extraversion :=rowMeans(.SD, na.rm =TRUE), .SDcols =c("BFI_E1", "BFI_E2")]db [, BFI_O1 :=6- BFI_O1r]db[, openness :=rowMeans(.SD, na.rm =TRUE), .SDcols =c("BFI_O1", "BFI_O2")]db [, BFI_C1 :=6- BFI_C1r]db[, conscientiousness :=rowMeans(.SD, na.rm =TRUE), .SDcols =c("BFI_C1", "BFI_C2")]db [, BFI_A2 :=6- BFI_A2r]db[, agreeableness :=rowMeans(.SD, na.rm =TRUE), .SDcols =c("BFI_A1", "BFI_A2")]## create some binary variables (based on the median - will see another way to create based on median later)db[, StressHigh :=factor(Stress >=10, levels =c(TRUE, FALSE), labels =c("High Stress","Low Stress"))]db[, SelfesteemHigh :=factor(SE >=3.8, levels =c(TRUE, FALSE), labels =c("High SE", "Low SE"))]
10 All Categorical Variables
If we are working with all categorical variables, a common way to present them is to make one variable “continuous” by calculating the percentages. For example, suppose that in our baseline data collection exercise data, we wanted to graph the association between categorical self-esteem and sex
egltable("SelfesteemHigh", g ="sex", data = db)
male N (%) female N (%)
<char> <char> <char>
1: SelfesteemHigh
2: High SE 8 (36.4%) 51 (42.9%)
3: Low SE 14 (63.6%) 68 (57.1%)
Test
<char>
1: Chi-square = 0.32, df = 1, p = .571, Phi = 0.05
2:
3:
10.1 Bar Plot
We can get a frequency table easily enough (as shown in the earlier code), but what if we wanted to graph it?
We could graph the frequencies, such as with a bar plot.
p.bar <-ggplot(db[!is.na(sex)], aes(sex, fill = SelfesteemHigh)) +geom_bar(position ="dodge")print(p.bar)
Beyond the data to ink ratio issues, if we were presenting this or putting it into a paper, we would want to label it more cleanly. Here we remove the x-axis title, since saying male and female makes it clear enough we don’t need to say “sex” is the variable name anymore. We could relabel the y axis (although count is fairly clear, I wanted to show how to do it). The theme cleans up and makes the font a bit bigger.
You can find more about customizing guides here: https://ggplot2.tidyverse.org/reference/guide_legend.html
We can change SelfesteemHigh by using scale_fill_manual() which lets us name the title of the legend and to specify the colours, by name or hexademical codes, for each group to make it black and white. We use the coord_cartesian() function to stop the axis expansion so that it begins exactly at zero. Finally, we get a title, with math symbols by using the ggtitle() function listing the chi-square p-value from egltable() analysis earlier.
A simple way would be to calculate the percentage of sex = "female" in each self esteem category and plot that. We would create a new dataset with percentages calculated along with confidence intervals using the following code. We calculate the average number of sex == "female" which is the proportion, and then use the prop.test() function which takes the count in one group and the total count and can calculate confidence intervals to get 95% confidence intervals for the proportions.
SelfesteemHigh Percent LL UL
<fctr> <num> <num> <num>
1: Low SE 0.829 0.734 0.895
2: High SE 0.864 0.755 0.930
Now we can plot the results. All we need for a basic plot is the ggplot() and geom_pointrange() but the rest helps polish up the figure for presentation.
The same basic strategy can work for many variables at scale fairly easily. For example, suppose that the personality dimensions were all categorical. We first create these categorical variables, noting that this is solely for the sake of demonstration. In general it is not a good idea to convert continuous variables to categorical ones.
Next, we select just the variables we want (personality, ID, and age) and reshape the dataset long where each variable is a “timepoint”. This allows us to have data table calculate the proportions of each easily by setting by self esteem category and by personality variable. The resulting dataset has proportion and confidence intervals of high on each personality measure for each self esteem group.
db[, O :=as.integer(openness >median(openness, na.rm=TRUE))]db[, C :=as.integer(conscientiousness >median(conscientiousness, na.rm=TRUE))]db[, E :=as.integer(extraversion >median(extraversion, na.rm=TRUE))]db[, A :=as.integer(agreeableness >median(agreeableness, na.rm=TRUE))]db[, N :=as.integer(neuroticism >median(neuroticism, na.rm=TRUE))]# Recall the content on reshaping from wide to long format from Lecture 2 (WorkData.rmd) dblong <-reshape( db[!is.na(SelfesteemHigh), .(ID, SelfesteemHigh, O, C, E, A, N)],varying =list(Score =c("O", "C", "E", "A", "N")),v.names ="Score",timevar ="Personality",times =c("O", "C", "E", "A", "N"),idvar ="ID",direction ="long")propdata2 <- dblong[, .(Percent =mean(Score ==1, na.rm =TRUE),LL =prop.test(x =sum(Score ==1, na.rm =TRUE),n =sum(!is.na(Score)), correct =FALSE)$conf.int[1],UL =prop.test(x =sum(Score ==1, na.rm =TRUE),n =sum(!is.na(Score)), correct =FALSE)$conf.int[2]), by = .(SelfesteemHigh, Personality)]propdata2[, Personality :=factor(Personality,levels =c("O", "C", "E", "A", "N"))]print(propdata2)
SelfesteemHigh Personality Percent LL UL
<fctr> <fctr> <num> <num> <num>
1: Low SE O 0.427 0.325 0.535
2: High SE O 0.390 0.276 0.517
3: Low SE C 0.232 0.154 0.334
4: High SE C 0.458 0.337 0.583
5: Low SE E 0.280 0.195 0.386
6: High SE E 0.542 0.417 0.663
7: Low SE A 0.378 0.281 0.486
8: High SE A 0.593 0.466 0.709
9: Low SE N 0.500 0.394 0.606
10: High SE N 0.441 0.322 0.567
Now we can plot the results. All we need for a basic plot is the ggplot(), geom_pointrange(), and facet_grid() but the rest helps polish up the figure for presentation. Note that we have not seen facet_grid() before. Facetting is an idea in data visualizing of making “small multiples”. Essentially the same plot over and over but with some changes. In this case, its the same plot over and over but changing the personality measure.
p.prop2 <-ggplot(propdata2, aes(SelfesteemHigh, y = Percent, ymin = LL, ymax = UL)) +geom_pointrange() +scale_y_continuous(labels = percent) +scale_x_discrete(breaks =c("High SE", "Low SE"),labels =c("High SE (median)", "Low SE (median)")) +xlab("") +ylab("Percent High (95% CI)") +theme_pubr() +facet_grid(Personality ~ .) +#when we created propdata2 in the chunk above, we told it to make a factor called Personality with O,C,E,A,N levelscoord_flip()print(p.prop2)
11 All Continuous Variables
For all continuous variables, there are not many graphing options. A scatter plot is the main way two continuous variables are visualized. However, even with a scatter plot, we can add additional information to make it more useful. Our starting point is the scatter plots we went over earlier. To that we add a linear regression line to help show the overall association between the two variables. We also add a text annotation with the correlation coefficient and p-value, found from running the cor.test() function. We label it using xlab() and ylab(). The main plot is saved in an object, p.ss. Finally, we use the ggMarginal() function from the ggExtra package to add histograms of the univariate distributions to the margins. The final result captures extensive information about the individual variables (through the histograms and five number summaries in the axes) and about their association (through the scatter plot, regression line, and correlation coefficient).
cor.test(~ SE + Stress, data = db)
Pearson's product-moment correlation
data: SE and Stress
t = -10, df = 140, p-value <2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.733 -0.539
sample estimates:
cor
-0.646
p.ss <-ggplot(db, aes(Stress, SE)) +geom_point() +stat_smooth(method ="lm", se =FALSE, size =1) +scale_x_continuous(breaks =as.numeric(quantile(db$Stress))) +scale_y_continuous(breaks =as.numeric(quantile(db$SE))) +theme_pubr() +theme(axis.line =element_blank()) +geom_rangeframe() +xlab("Perceived Stress") +ylab("Self Esteem") +annotate("text", x =max(db$Stress), y =max(db$SE),label ="r = -0.65, p < .001",size =6, hjust =1, vjust =1)## now add a histogram to the marginsggMarginal(p.ss, type ="histogram")
scatter plot with regression line and correlation
Note
Because we did not store the results of Stress and selfesteem with the histograms added to the margins, we have the basic scatter plots in the figure, but we could have the histograms as well if desired by saving the result in p.ss.
Another feature that is helpful with figures is to arrange sets of related figures together. For example, in the following code we make a plot of Stress and neuroticism scores and save it in p.sn. Now we can make a panel of graphs with two columns using the ggarrange() function.
cor.test(~ neuroticism + Stress, data = db)
Pearson's product-moment correlation
data: neuroticism and Stress
t = 7, df = 139, p-value = 7e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.354 0.607
sample estimates:
cor
0.49
# note I had to add na.rm argument to the quantile and min functions for neuroticismp.sn <-ggplot(db, aes(Stress, neuroticism)) +geom_point() +stat_smooth(method ="lm", se =FALSE, size =1) +scale_x_continuous(breaks =as.numeric(quantile(db$Stress))) +scale_y_continuous(breaks =as.numeric(quantile(db$neuroticism, na.rm =TRUE))) +theme_pubr() +theme(axis.line =element_blank()) +geom_rangeframe() +xlab("Perceived Stress") +ylab("Neuroticism") +annotate("text", x =max(db$Stress), y =min(db$neuroticism, na.rm =TRUE),label ="r = 0.49, p < .001",size =6, hjust =1, vjust =0)ggarrange( p.ss, p.sn,ncol =2,labels =c("A", "B"))
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
panel graph of two scatter plots
12 Mixed Continuous and Categorical Variables
The most variety of graphs are possible with data that are a mix of continuous and categorical variables. As an example we will work with self esteem group, sex, and the five personality traits. We start by reshaping them long so that which personality measure is being examined becomes another categorical variable.
ID SelfesteemHigh sex Personality Score
<num> <fctr> <fctr> <fctr> <num>
1: NA Low SE female O 4.0
2: 10 Low SE female O 3.5
3: 11 Low SE male O 3.0
4: 12 High SE female O 4.0
5: 13 Low SE female O 3.0
6: 14 Low SE female O 2.0
We can make a simple plot with the means and 95% confidence intervals using the following code.
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_summary()`).
Removed 5 rows containing non-finite outside the scale range
(`stat_summary()`).
Either rotating the labels or the graph made it easier to have long labels and clearly read them, in this case especially rotating the graph so the longest words can be read in their usual left to right orientation.
With smaller datasets, we could show raw data as well as the means. Although relatively easy to add, the result is disastrous. Even though this is not a huge dataset, there is enough data and because there are not many different possible values for each personality measure, the dotplot is very difficult to read. Lastly, because of the black dots and the means being shown as black dots, it becomes impossible to well see the means.
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_summary()`).
Warning: Removed 5 rows containing missing values or values outside the scale range
(`stat_bindot()`).
By shrinking the size of the dots further (using binwidth = .1), adding some transparency using the alpha = .2 argument (valid numbers are 0 completely transparent to 1 completely opaque) and adding some random noise on the scores using jitter() we can see the raw data and means better. It is not quite the actual raw data, because we have added some noise, but it could still help to show the general spread of data.
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_summary()`).
Warning: Removed 5 rows containing missing values or values outside the scale range
(`stat_bindot()`).
For larger datasets or if the dotplot with noise is not as useful anymore because it’s not true raw data, a more summarized version of the distribution can be shown with a violin plot, which is basically a density plot that is mirrored. Thicker regions have more points, narrow regions have fewer data points. We can also see the range/spread of each variable and still have our mean and confidence interval summaries shown clearly.
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_summary()`).
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_ydensity()`).
Another aspect we could improve: there is no necessary ordering of personality measures. Ordering them such as from highest to lowest mean can be used to help us read the plot more easily.
We do this by changing the levels of the factor in the dataset. First, we calculate the mean score by personality, then we have data table order the resulting means from highest to lowest (these are things we saw in the working with data topic). Finally, we use factor() on personality and specify the levels in this same order.
dblong2[, .(M =mean(Score, na.rm =TRUE)), by = Personality][order(-M)]
Personality M
<fctr> <num>
1: C 3.75
2: O 3.59
3: A 3.57
4: N 3.54
5: E 3.07
Now we can simply remake our graph (note that this only works when using data.table for data management, if using data frames etc. you would want to copy and paste all your graph code again).
By having ordered the variables by their means, it helps us rapidly interpret which one has the lowest score (extraversion) and which the highest average score (conscientiousness). It is a small step but one that aids rapid processing of the figure and the data therein.
print(p.mean3)
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_summary()`).
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_ydensity()`).
It is easy to add additional categorical variables into a figure. For example, we can colour by self esteem group.
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_summary()`).
If having four means side by side is too hard to read, we could facet the plot into small multiples, say by sex, so that we can compare age groups in male and female participants.
# note the use of na.omit at the beginning of ggplot. This is because there is NA data in 'sex'# and we don't want a separate facet for sex = NAggplot(na.omit(dblong2), aes(Personality, Score, colour = SelfesteemHigh)) +stat_summary(fun.data = mean_cl_normal, position =position_dodge(.3)) +scale_x_discrete("",breaks =c("O", "C", "E", "A", "N"),labels =c("Openness", "Conscientiousness", "Extraversion", "Agreeableness", "Neuroticism")) +theme_pubr() +coord_flip() +facet_grid(sex ~ .) +scale_colour_manual("Self Esteem Group",values =c("High SE"="black", "Low SE"="grey80"))
If we wanted to show the raw data, we could facet on both self esteem and sex and add our dotplots back in.